Review the Guns dataset from the “AER”" package.

Objective: Using the data collected see if we can Identify a model that helps best predict the violence rate.

Review a structure of the data

## 'data.frame':    1173 obs. of  13 variables:
##  $ year      : Factor w/ 23 levels "1977","1978",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ violent   : num  414 419 413 448 470 ...
##  $ murder    : num  14.2 13.3 13.2 13.2 11.9 10.6 9.2 9.4 9.8 10.1 ...
##  $ robbery   : num  96.8 99.1 109.5 132.1 126.5 ...
##  $ prisoners : int  83 94 144 141 149 183 215 243 256 267 ...
##  $ afam      : num  8.38 8.35 8.33 8.41 8.48 ...
##  $ cauc      : num  55.1 55.1 55.1 54.9 54.9 ...
##  $ male      : num  18.2 18 17.8 17.7 17.7 ...
##  $ population: num  3.78 3.83 3.87 3.9 3.92 ...
##  $ income    : num  9563 9932 9877 9541 9548 ...
##  $ density   : num  0.0746 0.0756 0.0762 0.0768 0.0772 ...
##  $ state     : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ law       : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Review data visually

The violent variable does not appear to be normally distributed

Scatterplot - Best way to show this?

## Loading required package: ggplot2

* The variables murder and robbery have the strongest positive linear relationship with the violent crime rate, with a correlation value of 0.827 and 0.907 + Considering these variables are components of the violent crime rate, we expected there to be high correlation values + These variables are more of a subcategory of violent crimes and therefore aren’t good predictors for a linear model * Other variables that showed a high correlation value were prisoners (0.703), density (0.665), afam (0.57), cauc(-0.573), and income (0.408) + The law variable was also a good indicator because the box plot showed that on average states that have a shall carry law in effect have a lower violent crime rate as well as as a lower murder and robbery rate.

Fit Plot

* The Fit Plot shows that the actual values fluctuate fairly significantly from the fitted values(pink reference line) * The Residual plot shows that a majority of the observed values fall within +/-500 of the fitted values + This could be considered a substantial deviation from the fitted values + Next lets examine if a log transformation or box-cox transformation would be useful

Log Transformation Fitted Plot & Transformation Residuals vs. Fitted

* The Log Transformed Fit Plot shows that the actual values fluctuate fairly significantly from the fitted values(blue reference line). * The Transformed Fit Plot vs Residual Plot is easier to read than the previous model (m5) + Y-axis, which shows the .resid is shown in Z-scores with a majority of the data falling between +/- 1 standard deviations from the mean + Moving forward we will use the log transformation of the violent variable to make it easier to interpret the data

****Check on the m5 statement******

Box-Cox Transformation

## Estimated transformation parameter 
##        Y1 
## 0.6022324

Should we remove this line from showing in the html?

Box-Cox Transformation Fitted Plot & Transformation Residuals vs. Fitted

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Normal QQ-plots for detecting non normality

Although there is not much seperation the Box-Cox Tansformed model seems to fit the best

Histogram

*The residuals of the Box-Cox Transformed model from the above plots appears to have the most normally distributed histogram

Shapiro test of Normality

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m1)
## W = 0.97023, p-value = 8.514e-15
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m2)
## W = 0.97727, p-value = 1.293e-12
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(mlam)
## W = 0.98664, p-value = 6.98e-09

Based on our analysis we will utilize the box-cox transformed model

Stepwise Regression

## Start:  AIC=2239.81
## violent^lam ~ year + murder + robbery + prisoners + afam + cauc + 
##     male + population + income + density + state + law
## 
##              Df Sum of Sq   RSS    AIC
## - cauc        1         0  6872 2237.8
## - density     1         0  6873 2237.9
## - afam        1         6  6879 2238.9
## - population  1         7  6879 2239.0
## - income      1         8  6880 2239.2
## <none>                     6872 2239.8
## - law         1        78  6950 2251.0
## - male        1       100  6972 2254.7
## - murder      1       132  7005 2260.2
## - prisoners   1       218  7090 2274.4
## - year       22      1632  8504 2445.7
## - robbery     1      4594 11467 2838.3
## - state      50     33830 40702 4226.3
## 
## Step:  AIC=2237.81
## violent^lam ~ year + murder + robbery + prisoners + afam + male + 
##     population + income + density + state + law
## 
##              Df Sum of Sq   RSS    AIC
## - density     1         0  6873 2235.9
## - population  1         8  6880 2237.1
## - income      1         8  6881 2237.2
## <none>                     6872 2237.8
## - afam        1        16  6888 2238.5
## - law         1        78  6951 2249.1
## - murder      1       133  7006 2258.3
## - male        1       194  7066 2268.4
## - prisoners   1       231  7104 2274.6
## - year       22      1825  8697 2470.0
## - robbery     1      4595 11467 2836.4
## - state      50     33910 40782 4226.6
## 
## Step:  AIC=2235.88
## violent^lam ~ year + murder + robbery + prisoners + afam + male + 
##     population + income + state + law
## 
##              Df Sum of Sq   RSS    AIC
## - population  1         7  6880 2235.1
## - income      1         9  6882 2235.4
## <none>                     6873 2235.9
## - afam        1        23  6896 2237.8
## - law         1        79  6952 2247.3
## - murder      1       136  7009 2256.9
## - male        1       193  7066 2266.4
## - prisoners   1       390  7263 2298.6
## - year       22      1825  8698 2468.1
## - robbery     1      4594 11467 2834.4
## - state      50     43238 50111 4466.2
## 
## Step:  AIC=2235.13
## violent^lam ~ year + murder + robbery + prisoners + afam + male + 
##     income + state + law
## 
##             Df Sum of Sq   RSS    AIC
## - income     1         9  6889 2234.6
## <none>                    6880 2235.1
## - afam       1        18  6898 2236.2
## - law        1        78  6958 2246.4
## - murder     1       130  7010 2255.0
## - male       1       199  7080 2266.6
## - prisoners  1       452  7332 2307.8
## - year      22      1849  8729 2470.3
## - robbery    1      4637 11517 2837.4
## - state     50     47934 54814 4569.5
## 
## Step:  AIC=2234.63
## violent^lam ~ year + murder + robbery + prisoners + afam + male + 
##     state + law
## 
##             Df Sum of Sq   RSS    AIC
## <none>                    6889 2234.6
## - afam       1        15  6904 2235.1
## - law        1        89  6978 2247.7
## - murder     1       172  7061 2261.5
## - male       1       206  7095 2267.2
## - prisoners  1       455  7344 2307.7
## - year      22      1844  8733 2468.9
## - robbery    1      4688 11577 2841.6
## - state     50     48242 55131 4574.2

From the stepwise regression we see that cauc, density, and income both should be removed from the data to create a globally optimal model

Compare Coefficients of the two models

## Calls:
## 1: lm(formula = violent^lam ~ ., data = Guns)
## 2: lm(formula = violent^lam ~ year + murder + robbery + prisoners + 
##   afam + male + state + law, data = Guns)
## 
##                            Model 1  Model 2
## (Intercept)                   11.3     11.6
## year1978                     0.886    0.955
## year1979                      2.32     2.39
## year1980                      2.65     2.67
## year1981                      2.41     2.45
## year1982                      2.57     2.60
## year1983                      2.32     2.39
## year1984                      3.26     3.42
## year1985                      4.14     4.35
## year1986                      5.33     5.58
## year1987                      5.46     5.75
## year1988                      6.41     6.74
## year1989                      6.99     7.36
## year1990                      8.93     9.30
## year1991                      9.55     9.90
## year1992                      10.3     10.7
## year1993                      10.7     11.1
## year1994                      10.3     10.8
## year1995                      9.96    10.41
## year1996                      8.94     9.42
## year1997                      8.90     9.43
## year1998                      8.08     8.69
## year1999                      6.98     7.63
## murder                       0.152    0.161
## robbery                     0.0499   0.0496
## prisoners                  0.00993  0.01060
## afam                        -0.419   -0.337
## cauc                      -0.00345         
## male                          1.12     1.15
## population                   0.148         
## income                    0.000135         
## density                     -0.354         
## stateAlaska                   1.21     1.56
## stateArizona              -0.53089 -0.00311
## stateArkansas                -3.57    -3.56
## stateCalifornia             -0.905    3.402
## stateColorado                -4.31    -3.37
## stateConnecticut             -9.31    -8.10
## stateDelaware               -0.431   -0.277
## stateDistrict of Columbia    3.678   -0.974
## stateFlorida                  9.82    11.71
## stateGeorgia                 -2.05    -1.61
## stateHawaii                  -7.67    -8.70
## stateIdaho                   -11.4    -11.1
## stateIllinois                 1.08     3.01
## stateIndiana                 -7.29    -6.28
## stateIowa                    -12.8    -12.0
## stateKansas                  -7.50    -6.87
## stateKentucky                -10.5    -10.1
## stateLouisiana                4.42     4.26
## stateMaine                   -17.1    -16.5
## stateMaryland                 3.05     3.60
## stateMassachusetts           0.765    2.131
## stateMichigan               -0.824    0.603
## stateMinnesota               -13.5    -12.3
## stateMississippi             -7.39    -8.02
## stateMissouri                -2.36    -1.51
## stateMontana                 -15.6    -15.4
## stateNebraska               -10.19    -9.57
## stateNevada                  -2.67    -2.23
## stateNew Hampshire           -19.2    -18.3
## stateNew Jersey              -5.28    -3.92
## stateNew Mexico               8.06     8.09
## stateNew York               -3.505   -0.626
## stateNorth Carolina          -1.45    -0.91
## stateNorth Dakota            -25.0    -24.7
## stateOhio                    -9.26    -7.54
## stateOklahoma                -3.22    -2.97
## stateOregon                  -3.06    -2.25
## statePennsylvania           -10.26    -8.19
## stateRhode Island            -7.29    -6.98
## stateSouth Carolina           10.4     10.1
## stateSouth Dakota            -17.0    -16.7
## stateTennessee              -1.484   -0.917
## stateTexas                   -5.87    -3.51
## stateUtah                    -13.1    -12.7
## stateVermont                 -18.9    -18.4
## stateVirginia                -12.1    -11.3
## stateWashington              -4.34    -3.20
## stateWest Virginia           -15.5    -15.1
## stateWisconsin               -16.9    -15.8
## stateWyoming                 -10.7    -10.3
## lawyes                       -1.09    -1.14

As you can see the larger model model mlam has larger standard errors then the smaller model m3, which has gone through the stepwise regression

P-Value of the Partial F-test

## Analysis of Variance Table
## 
## Model 1: violent^lam ~ year + murder + robbery + prisoners + afam + male + 
##     state + law
## Model 2: violent^lam ~ year + murder + robbery + prisoners + afam + cauc + 
##     male + population + income + density + state + law
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   1094 6889.0                           
## 2   1090 6872.4  4    16.557 0.6565 0.6224

Conclusion

The results produced by the anova function show a non-significant result (p-value = 0.6224). Therefore we should reject the larger model (mlam) and move forward with the smaller model (m3).

To be completed by JH and added potentially

Checking for any influential points that are controlling the results

##         StudRes        Hat       CookD
## 185   0.4156464 0.25978614 0.000731069
## 189  -5.0794642 0.25204347 0.102420286
## 195   3.9512822 0.13445287 0.028833267
## 207   0.4733689 0.29029093 0.001105054
## 1127  4.8407145 0.06827768 0.020271504

Transformation Check

* After looking at the scaled scatterplot of that data we determine that cauc needs to be transformed

Stepwise regression with cauc normalized with log

## Start:  AIC=9240.41
## violent ~ year + murder + robbery + prisoners + afam + log(cauc) + 
##     male + population + income + density + state + law
## 
##              Df Sum of Sq      RSS     AIC
## - density     1       236  2685757  9238.5
## - log(cauc)   1       335  2685857  9238.6
## <none>                     2685522  9240.4
## - income      1      5124  2690646  9240.6
## - afam        1      6355  2691877  9241.2
## - male        1     18702  2704223  9246.5
## - population  1     24164  2709685  9248.9
## - law         1     26090  2711611  9249.7
## - prisoners   1    177230  2862751  9313.4
## - murder      1    332563  3018084  9375.3
## - year       22    502422  3187943  9397.6
## - robbery     1   3052545  5738066 10129.0
## - state      50   8884729 11570250 10853.6
## 
## Step:  AIC=9238.51
## violent ~ year + murder + robbery + prisoners + afam + log(cauc) + 
##     male + population + income + state + law
## 
##              Df Sum of Sq      RSS     AIC
## - log(cauc)   1       219  2685976  9236.6
## <none>                     2685757  9238.5
## - income      1      5544  2691301  9238.9
## - afam        1      6479  2692237  9239.3
## - male        1     18570  2704327  9244.6
## - population  1     24029  2709786  9247.0
## - law         1     26389  2712146  9248.0
## - prisoners   1    303717  2989474  9362.2
## - murder      1    336617  3022375  9375.0
## - year       22    502244  3188001  9395.6
## - robbery     1   3058367  5744124 10128.2
## - state      50   9307137 11992894 10893.7
## 
## Step:  AIC=9236.6
## violent ~ year + murder + robbery + prisoners + afam + male + 
##     population + income + state + law
## 
##              Df Sum of Sq      RSS     AIC
## <none>                     2685976  9236.6
## - income      1      5838  2691814  9237.2
## - afam        1     16775  2702751  9241.9
## - law         1     26764  2712740  9246.2
## - male        1     27232  2713208  9246.4
## - population  1     27239  2713215  9246.4
## - prisoners   1    303511  2989487  9360.2
## - murder      1    336791  3022767  9373.2
## - year       22    527892  3213868  9403.1
## - robbery     1   3136507  5822483 10142.1
## - state      50  10803349 13489325 11029.7

Interestingly this leads to the stepwise regression only from removing density and cauc from the data set but leaving income.

## Analysis of Variance Table
## 
## Model 1: violent^lam ~ year + murder + robbery + prisoners + afam + male + 
##     state + law
## Model 2: violent^lam ~ year + murder + robbery + prisoners + afam + cauc + 
##     male + population + income + density + state + law
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   1094 6889.0                           
## 2   1090 6872.4  4    16.557 0.6565 0.6224

Cross-Validation for Linear Models

Cross-Validation for Linear Models against 10 seeds each

##      mse.m1     mse.m2   mse.m3   mse.m4 mse.mlam
## 1  2889.625 0.01761853 7.150634 2890.016 7.201197
## 2  2890.438 0.01885826 7.429532 2834.517 7.645189
## 3  2738.429 0.01737850 6.972956 2742.381 6.989694
## 4  2859.606 0.01764531 7.123702 2836.695 7.216332
## 5  2712.441 0.01768070 6.987390 2706.323 7.057505
## 6  2919.177 0.01747342 7.073046 2921.903 7.136046
## 7  2933.971 0.01829054 7.059834 2922.766 7.310023
## 8  2712.441 0.01768070 6.987390 2706.323 7.057505
## 9  2919.177 0.01747342 7.073046 2921.903 7.136046
## 10 2933.971 0.01829054 7.059834 2922.766 7.310023
## Analysis of Variance Table
## 
## Model 1: violent^lam ~ year + murder + robbery + prisoners + afam + male + 
##     state + law
## Model 2: violent^lam ~ year + murder + robbery + prisoners + afam + cauc + 
##     male + population + income + density + state + law
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   1094 6889.0                           
## 2   1090 6872.4  4    16.557 0.6565 0.6224

```